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Abstract 

This paper presents the formulation of a combinatorial optimization problem with the following 
characteristics: i.the search space is the power set of a finite set structured as a Boolean lattice; w.the 
cost function forms a U-shaped curve when applied to any lattice chain. This formulation applies for 
feature selection in the context of pattern recognition. The known approaches for this problem are 
branch-and-bound algorithms and heuristics, that explore partially the search space. Branch-and-bound 
algorithms are equivalent to the full search, while heuristics are not. This paper presents a branch-and- 
bound algorithm that differs from the others known by exploring the lattice structure and the U-shaped 
chain curves of the search space. The main contribution of this paper is the architecture of this algorithm 
that is based on the representation and exploration of the search space by new lattice properties proven 
here. Several experiments, with well known public data, indicate the superiority of the proposed method 
to SFFS, which is a popular heuristic that gives good results in very short computational time. In all 
experiments, the proposed method got better or equal results in similar or even smaller computational 
time. 

Index Terms 

Boolean lattice; branch-and-bound algorithm; U-shaped curve; classifiers; W-operators; feature 
selection; subset search; optimal search. 

I. INTRODUCTION 

A combinatorial optimization algorithm chooses the object of minimum cost over a finite 
collection of objects, called search space, according to a given cost function. The simplest 
architecture for this algorithm, called full search, access each object of the search space, but 
it does not work for huge spaces. In this case, what is possible is to access some objects 
and choose the one of minimum cost, based on the observed measures. Heuristics and branch- 
and-bound are two families of algorithms of this kind. An heuristic algorithm does not have 
formal guaranty of finding the minimum cost object, while a branch-and-bound algorithm has 
mathematical properties that guarantee to find it. 

Here, it is studied a combinatorial optimization problem such that the search space is composed 
of all subsets of a finite set with n points (i.e., a search space with 2 n objects), organized as 
a Boolean lattice, and the cost function has a U-shape in any chain of the search space or, 
equivalently, the cost function has a U-shape in any maximal chain of the search space. 
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This structure is found in some applied problems such as feature selection in pattern recogni- 
tion [5], [7] and W-operator window design in mathematical morphology [8]. In these problems, 
a minimum subset of features, that is sufficient to represent the objects, should be chosen from 
a set of n features. In W-operator design, the features are points of a finite rectangle of Z 2 
called window. The U-shaped functions are formed by error estimation of the classifiers or of 
the operators designed or by some measures, as the entropy, on the corresponding estimated join 
distribution. This is a well known phenomenon in pattern recognition: for a fixed amount of 
training data, the increasing number of features considered in the classifier design induces the 
reduction of the classifier error by increasing the separation between classes until the available 
data becomes too small to cover the classifier domain and the consequent increase of the 
estimation error induces the increase of the classifier error. Some known approaches for this 
problem are heuristics. A relatively well succeeded heuristic algorithm is SFFS [11], which 
gives good results in relatively small computational time. 

There is a myriad of branch-and-bound algorithms in the literature that are based on mono- 
tonicity of the cost-function [6], [10], [14], [15]. For a detailed review of branch-and-bound 
algorithms, refer to [13]. If the real distribution of the joint probability between the patterns 
and their classes were known, larger dimensionality would imply in smaller classification errors. 
However, in practice, these distributions are unknown and should be estimated. A problem with 
the adoption of monotonic cost-functions is that they do not take into account the estimation 
errors committed when many features are considered ("curse of dimensionality" also known as 
"U-curve problem" or "peaking phenomena" [7]). 

This paper presents a branch-and-bound algorithm that differs from the others known by 
exploring the lattice structure and the U-shaped chain curves of the search space. 

Some experiments were performed to compare the SFFS to the U-curve approach. Results 
obtained from applications such as W-operator window design, genetic network architecture 
identification and eight UCI repository data sets show encouraging results, since the U-curve 
algorithm beats (i.e., finds a node with smaller cost than the one found by SFFS) the SFFS 
results in smaller computational time for 27 out of 38 data sets tested. For all data sets, the 
U-curve algorithm gives a result equal or better than SFFS, since the first covers the complete 
search space. 

Though the results obtained with the application of the method developed to pattern recognition 



October 30, 2008 



DRAFT 



A BRANCH-AND-BOUND OPTIMIZATION ALGORITHM FOR U-SHAPED COST FUNCTIONS ON BOOLEAN LATTICES 4 

problems are exciting, the great contribution of this paper is the discovery of some lattice algebra 
properties that lead to a new data structure for the search space representation, that is particularly 
adequate for updates after up-down lattice interval cuts (i.e., cuts by couples of intervals [0,X] 
and [X,W]). Classical tree based search space representations does not have this property. For 
example, if the Depth First Search were adopted to represent the Boolean lattice only cuts in 
one direction could be performed. 

Following this introduction, Section 2 presents the formalization of the problem studied. 
Section 3 describes structurally the branch-and-bound algorithm designed. Section 4 presents the 
mathematical properties that support the algorithm steps. Section 5 presents some experimental 
results comparing U-curve to SFFS. Finally, Conclusion discusses the contributions of this paper 
and proposes some next steps of this research. 

II. The Boolean U-curve optimization problem 

Let W be a finite subset, £?{W) be the collection of all subsets of W, C be the usual inclusion 
relation on sets and, \W\ denote the cardinality of W . The search space is composed by 2^ 
objects organized in a Boolean lattice. 

The partially ordered set (£?(W),C) is a complete Boolean lattice of degree \W\ such 
that: the smallest and largest elements are, respectively, and W; the sum and product are, 
respectively, the usual union and intersection on sets and the complement of a set X in £?(W) 
is its complement in relation to W, denoted by X c . 

Subsets of W will be represented by strings of zeros and ones, with meaning that the point 
does not belong to the subset and 1 meaning that it does. For example, if W — {(— 1, 0), (0, 0), 
(+1, 0)}, the subset {(—1, 0), (0, 0)} will be represented by 110. In an abuse of language, X = 
110 means that X is the set represented by 110. 

A chain A is a collection {A lt A 2 , ... , A k } C X C @>(W) such that A 1 C A 2 C . . . C A k . A 
chain Ai C X is maximal in X if there is no other chain C C X such that C contains properly 
M. 

Let c be a cost function defined from (W) to R. We say that c is decomposable in U-shaped 
curves if, for every maximal chain M C 3 s (W), the restriction of c to M is a U-shaped curve, 
i. e., for every A, X, B e M, A C X C B ^ max(c(A), c(B)) > c(X). 
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Fig. 1. A complete Boolean lattice C of degree 4 and the cost function decomposable in U-shaped curves. X — C — 
{0000,0010,0001, 1110, 1111} is a poset obtained from C. A maximal chain in C is emphasized. The element 0111 is the 
global minimum element and 0101 is the local minimum element in the maximal chain. 



Figure 1 shows a complete Boolean lattice C of degree 4 with a cost function c decomposable 
in U-shaped curves. In this figure, it is emphasized a maximal chain in C and its cost function. 
Figure 2 presents the curve of the same cost function restricted to some maximal chains in C 
and in X C C Note the U- shape of the curves in Figure 2. 

Our problem is to find the element (or elements) of minimum cost in a Boolean lattice of 
degree \W\. The full search in this space is an exponential problem, since this space is composed 
by 2' vl/ l elements. Thus, for moderately large \W\, the full search becomes unfeasible. 
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Fig. 2. The four possible representaion of the cost function c restricted to some maximal chains in C (a) and in X C C (b-d) 
of Figure 1 . 



III. The U-curve algorithm 

The U-shaped format of the restriction of the cost function to any maximal chain is the 
key to develop a branch-and-bound algorithm, the U-curve algorithm, to deal with the hard 
combinatorial problem of finding subsets of minimum cost. 

Let A and B be elements of the Boolean lattice C. An interval [A, B] of C is the subset of C 
given by [A, B] = {X e C : A C X C B}. The elements A and B are called, respectively, the 
left and right extremities of [A, B\. Intervals are very important for characterizing decompositions 
in Boolean lattices [2], [4]. 

Let R be an element of C. In this paper, intervals of the type [0, R] and [R, W] are called, 
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respectively, lower and upper intervals. The right extremity of a lower interval and the left 
extremity of an upper interval are called, respectively, lower and upper restrictions. Let TZl and 
IZu denote, respectively, collections of lower and upper intervals. The search space will be the 
poset X(TZ L: TZ V ) obtained by eliminating the collections of lower and upper restrictions from £, 
i. e., X(K L ,Ku) = £-\J{[Q,R]: Re TZ L }-{J{[R, W] : R e TZ V }. Incases in which only the 
lower or the upper intervals are eliminated, the resulting search space is denoted, respectively, 
by X(TZ L ) and X{TZ V ) and given, respectively, by X(TZ L ) = £ — |J{[0, R] : R £ TZ L } and 
X{Ku) = £ - \J{[R, W\ : R e Ku}. 

The search space is explored by an iterative algorithm that, at each iteration, explores a small 
subset of X(TZ L ,TZu), computes a local minimum, updates the list of minimum elements found 
and extends both restriction sets, eliminating the region just explored. The algorithm is initiated 
with three empty lists: minimum elements, lower and upper restrictions. It is executed until the 
whole space is explored, i. e., until X(TZl,TZu) becomes empty. The subset of X(TZl,TZu) 
eliminated at each iteration is defined from the exploration of a chain, which may be done 
in down-up or up-down direction. Algorithm 1 describes this process. The direction selection 
procedure (line 5) can use a random or an adaptative method. The random method states a static 
probability to select the down-up or up-down direction. The adaptative method calculates a new 
probability to each direction giving more probability to down-up direction if most of the local 
minima is closest to the bottom of the lattice and up-down otherwise. 

An element C of the poset X C £ is called a minimal element of X, if there is no other 
element C of X with C C C. In Figure 1, the minimal elements of X(TZ L ) are: 1000, 0100 
and 0011. If the down-up direction is chosen, the Down-Up-Direction procedure is performed 
(algorithm 2): 

• Minimal- Element procedure calculates a minimal element B of the poset X(TZ L ). Only the 
lower restriction set is used to calculate the minimal element B. An element B is said to be 
covered by the lower restriction set TZl, if 3R e TZl ■ B C. R, and B is said to be covered 
by the upper restriction set TZu, if 3R e TZu : R ^ B. When the calculated B is covered 
by an upper restriction, it is discarded, i.e., the lower restriction set is updated with B and 
a new iteration begins (lines 1-5). 

• The down-up direction chain exploration procedure begins with a minimal element B and 
flows by random selection of upper adjacent elements from the current poset X(TZ L , TZu) 
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Algorithm 1 U-curve-algorithm() 

1: M <= 

2: TZ L <= 

3: Ku^Q 

4: while X(K L ,Ku) 7^0 do 

5: direction <= Select-Direction() 

6: if direction is t// 5 then 

7: Down-Up-Direction(7?.L, 72.{/) 

8: else 

9: Up-Down-Direction(7£ L , 7\V) 

10: end if 

ii: end while 



until it finds the U-curve condition, i. e., the last element selected (B) has cost bigger than 
the previous one (M) (lines 7-11). 

• At this point, the element M is the minimum element of the chain explored, A and B are, 
respectively, the lower and upper adjacent elements of M (i.e., A C M C B and {X e 
£»(W) : A C M} = {X G ^(W) : M C B = 0) by construction, c(A) < c(M) < 67(5). 
It can be proved that any element C of X(1Zl, Tlu), with C C A, has cost bigger than A and, 
any element D of X(1Z L) TZjj), with B C D, has cost bigger than £?. By using this property, 
the lower and upper restrictions can be updated, respectively, by A and B (lines 12-17). 
Figure 3 shows a schematic representation of the first iteration of the algorithm and the 
elements contained in the intervals [0, A = 1 . . . 1010 ... 0] and [B = 1 ... 11110 ... 0, W]. 

• The result list can be updated with M (line 18) , i. e., M will be included in the result list 
if it has cost lower than the elements already saved in the list. The result list can save a 
pre-defined number of elements with low costs or only elements with the overall minimum 
cost. 

• In order to prevent visiting the element M more than once, a recursive procedure called 
minimum exhausting procedure is performed (line 19) 

An element is called a minimum exhausted element in £ if all its adjacents elements (upper 
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Algorithm 2 Down-Up-Direction(ElementSet 1Z L , ElementSet TZu) 
i: B <= Minimal-Element(7?.L) 

2: if B is covered by TZu then 

3: Update-Lower- Restriction^, TZu) 

4: return 

5: end if 

6: M <= null 

7: repeat 

8: A M 
9: M <= B 

10: B <= Select-Upper-Adjacent(M, TZ L , TZu) 

ii: until c(B) > c(M) or B = null 

12: if A ^ null then 

13: Update-Lower- Restriction^, 1Z L ) 

14: end if 

15: if B j£ null then 

16: Update-Upper-Restriction(S, TZu) 
17: end if 

18: Update-Results(M) 

19: Minimum-Exhausting(M, 1Z L , TZu) 



and lower) have cost bigger than it. This definition can be extended to the poset X(7Z L ,7Zu), 
i. e., all its adjacent elements (upper and lower) in X(TZ L) TZ U ) have cost bigger than it. In 
Figure 1 we can see that the elements 1010, 1001 and 0111 are minimum exhauted elements in 
X(7Z L ,7Zu), but 1001 is not a minimum exhauted element in C. In this paper, the term minimum 
exhausted will be applied always refering to a poset X(TZ L ,TZ U ). 

The minimum exhausting procedure (Algorithm 3) is a recursive process that visit all the 
adjacent elements of a given element M and turn all of them into minimum exhausted elements 
in the resulting poset X(1Z l ,1Zjj). It uses a stack S to perform the recursive process. S is 
initialized by pushing M to it and the process is performed while S is not empty (lines 2-22). 
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111... 1 



12.32 



01101... 1 0101... 11 001. ..11 




0...1001 0...0101 0...0011 



Fig. 3. A schematic representation of a step of the algorithm, the detached areas represents the elements contained in a lower 
and upper restrictions. 



At each iteration, the algorithm processes the top element T of S: all the adjacent elements 
(upper and down) of T in X(1Zl,1Zu) and not in S are checked. If the cost of an adjacent 
element A is lower (or equal) than the cost of T then A is pushed to S. If the cost of A is 
bigger than the cost of T then one of the restriction sets can be updated with A, lower restriction 
set if A is lower adjacent of T and upper restriction set if A is upper adjacent of T (lines 5-16). 
If T is a minimum exhausted element in X(1Zl,1Zu), i. e., there is no adjacent element A in 
X(JZl,TZu) with cost lower than T, then T is removed from S and, also, the restriction sets and 
the result list are updated with T (lines 19-21). At the end of this procedure all the elements 
processed are minimum-exhausted elements in X(1Zl,1Zu). 
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Algorithm 3 Minimum-Exhausting(Element M, ElementSet 71 l, ElementSet TZu) 
l: Push M to S 

2: while S is not empty do 



3: T <= Top(S) 

4: MinimumExhausted <^= true 

5: for all A adjacent of T in X(H L , TZu) and A ^ S do 

6: if c(A) < c(T) then 

7: Push A to S 

8: MinimumExhausted <= false 

9: else 

10: if A is upper adjacent of T then 

11: Update-Upper- Restriction^, TZu) 

12: else 

13: Update-Lower-Restriction(A, 71 l) 

14: end if 

15: end if 

16: end for 

17: if MinimumExhausted then 

18: Pop T from 5 

19: Update-Results(T) 

20: Update-Lower- Restriction(T, 7£l) 

21: Update-Upper-Restriction(T, TZu) 

22: end if 



23: end while 
24: return 



Figure 4 shows a graphical representation of the minimum exhausting process. 4-A shows 
a chain construction process in up direction, the chain has its edges emphasized. The element 
M = 010101 (orange-colored) has the minimum cost over the chain. The elements in black 
are the elements eliminated from the search space by the restrictions obtained by the lower and 
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upper adjacent elements of the local minimum M. The stack begins with the element M. Figure 
4-B shows the first iteration of the minimum exhausting process. The arrows in red and the 
elements in red indicates the adjacents elements of M (top of the stack) that have cost lower 
(or equal) than it. These elements 010001 and 010111 are pushed to the stack. The adjacent 
elements of M with cost bigger than it can update the restriction sets, i. e., the lower adjacent 
element 000101 updates the lower restriction set and the upper adjacent element 000101 updates 
the upper restriction set. Figure 4-C shows the second iteration: the adjacent elements 010011 
and 000111 with cost lower (or equal) than the new top element 010111 are pushed to the 
stack and the other adjacent elements 010110 and 011111 with cost bigger than 010111 update, 
respectively, the lower and upper restriction sets. In Figure 4-D the element 000111 is a minimum 
exhausted element (grey color) in X(JZ, Ll TZu) and it is is removed from stack. In Figure 4-E the 
elements eliminated by the new interval [0, 000111] and [000111, W\ are turned into black color. 
At this point, 010011 is a minimum exhausted (grey color) in X (TZl, TZu) and it is removed 
from stack. From Figure 4-F to Figure 4-H all the elements are removed from stack and the 
elements removed by the new restrictions are turned into black color. Figure 4-H shows all the 
elements removed from a single minimum exhausted process. 

The procedures to calculate minimal and maximal elements and the procedure to update lower 
and upper restriction sets will be discussed in the next section. 

IV. Mathematical foundations 
This section introduces mathematical foundations of some modules of the algorithm. 

A. Minimal and Maximal Construction Procedure 

Each iteration of the algorithm requires the calculation of a minimal element in X(TZ L ) or a 
maximal element in X(1Zjj). It is presented here a simple solution for that. The next theorem 
is the key for this solution. 
Theorem 1. For every A G X(K L ), 

A G X(TZ L ) Af]R c ^ 0,W2 G TZ L . 

Proof: (in Appendix Section) 
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Algorithm 4 implements the minimal construction procedure. It builds a minimal element C of 
the poset X(1Z L ). The process begins with C = (Ij^l,) and S = (lj^l,) and executes a n-loop 

n n 

(lines 3-16) trying to remove components from C. At each step, a component k, k e {1, . . . , n) 
is chosen exclusively from S (S prevents multi- selecting). If the element C resulted from C by 
removing the component k is contained in X(TZ L ) then C is updated with C (lines 7-15). 



Algorithm 4 Minimal-Element(ElementSet 1Z L ) 



1: C ^ l_^A 

n 

2: S^l_^l 

n 

3: while S do 

n 

4: k random index in {1, . . . , n} where S'f/c] = 1 

5: S[k] <= 
6: C <= C \ k 

7: RemoveElement <= true 

8: for all R'm1Z L do 

9: if R c n C" = then 
10: RemoveElement <= false 

ii: end if 

12: end for 

13: if RemoveElement then 

14: C <*= C" 

15: end if 
16: end while 
17: return C 



The minimal element calculated is equal to l^^Jj, when 72. l = {^L^JJ. At this point, the poset 

n n 

X(TZl-,TIu) is empty and the algorithm stops in the next iteration. 

The next theorem proves the correctness of Algorithm 4 . 
Theorem 2. The element C of X(1Zl) returned by the minimal construction process (Algorithm 
4) is a minimal element in X(1Z L ). 
Proof: (in Appendix Section) 
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The process to calculate a maximal element in X(1Zjj) is dual to the one to calculate a 
minimal, i. e., it begins with C = and, at each step, when the complement C' c of the 

n 

resulting C has not empty interseccion to all the elements of IZu, adds a component k to C. 

B. Lower and Upper Restrictions Update 

The restriction sets TZl and IZjj represent the search space. Thus, they are updated after each 
new search by the following rule: an element A is added to the lower (or upper) restriction set 
if all elements of [0, A] (or [A, W\) have costs bigger or equal to A. 

The next theorem establishes the U-curve condition, that permits to stop the chain construction 
process and to update the restriction sets. 

Theorem 3. Let Cq, Ck-i, Ck be the chain constructed by Algorithm 2 (or its dual version). 
Let c be the cost function from C to R decomposable in U-shaped curves and c(Ck) > c(Ck-i), 
then 

VA G C, C k C A =>■ c(A) > c(C fc ). 

Proof: (in Appendix Section) 

By a similar proof to the one of Theorem 3, it can be proved that all the elements in £ 
contained in Ck-2 have also cost bigger or equal to it. Figure 3 shows the chain obtained by 
the chain construction process and the resulted poset. The elements detached have always cost 
bigger than the elements C k = (1 . . . 11110 ... 0) or C fc _ 2 = (1 . . . 1010 ... 0). 

Algorithm 5 describes the update process of the lower restriction set by an element A. If A 
is already covered by 1Z L , i. e., there exists an element of 1Z L that contains A then the process 
stops (lines 1-3). Otherwise, all the elements in 1Z L contained in A are removed from 1Z L and A 
is added to TZl (lines 4-9). This procedure may diminish the cardinality of the restriction set, but 
does not diminish the cardinality of the resulting poset X(1Z L ), since the removed restrictions 
are contained in A. 

The upper restriction list updating procedure is dual to the lower one, i. e., in this case we 
look for elements contained in A instead of elements that contain A. 
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Algorithm 5 Update-Lower- Restriction(Element A, ElementSet 1Z L ) 



1: 


if there exists R from IZl where A C R then 


2: 


return 


3: 


end if 


4: 


for all R in TZ L do 


5: 


if R C A then 


6: 


n L = n L \ {R} 


7: 


end if 


8: 


end for 


9: 




10: 


return 



C. Minimum Exhausting Procedure 

The computation of the cost function in general is heavy. Thus, it is desirable that each element 
be visited (and its cost computed) a single time. A way of preventing this reprocessing is to 
apply the minimum exhausting procedure. This procedure is a recursive function (Algorithm 3). 
It uses a stack S to process recursively all the neighborhood of a given element M contained in 
the poset X(TZl, TZu)- At each recursion, it visits the upper and lower adjacent elements of T, 
the top of S, in X(1Z L) TZu) and not in S. The adjacent elements with cost bigger than the cost 
of T are elements satisfying the U-curve condition, so they can update the restriction sets and, 
consequently, be removed from the search space. The adjacent elements with cost lower or equal 
to T are pushed to S to be processed in later iterations. Note that elements are not reprocessed 
during the exhausting procedure, since this procedure checks if a new element explored is in an 
interval or in S, before computing its cost. If T is a minimum exhausted element in X(JZl, TZu) 
then T is removed from S. After the whole procedure is finished, all elements processed are 
out of the resulting poset X(TZ L ,TZu), so they will not be reprocessed in the next iterations. 
The fact that an element can not be reprocessed along the procedure implies that the cardinality 
of X(1Z L) 1Zjj) is an upper limit for the procedure number of steps. In search spaces that are 
lattices with high degree, this procedure can have to process a huge number of elements and some 
heuristics should be necessary. For example, to stop the search for adjacent elements smaller 
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Fig. 5. Illustration of error curve oscillation and alternative way. 

than a minimum after some badly succeeded trials. 

The minimum exhausting procedure gives another interesting property to the U-curve algo- 
rithm. If the cost function on maximal chains are U-shaped curves with oscillations, as illustrated 
in Figure 5-A, the U-curve algorithm may lose a local minimum element. Note that, in this case, 
the local minimum element after the oscillation has cost smaller than the cost of one before. 
However, this minimum is not lost if there is another chain, with a true U-shaped cost function, 
containing both local minimum elements. Figure 5-B shows an alternative chain (chain in red) 
that reaches the true minimum element of the chain (element in black). Note that the first local 
minimum (element in yellow) is contained in both chains. The true minimum, reached by the 
alternative chain, is obtained exactly by the exhausting of the first minimum found. Hence, 
the exhausting procedure permits to relax the class of problems approached by the U-curve 
algorithm. 

V. Experimental Results 

In this section, some results of applications of U-curve algorithm to feature selection are given 
and compared to SFFS [1 1]. For this study several data sets were used: W-operator window design 
[8], architecture identification in genetic networks and several data sets from the UCI Machine 
Learning Repository [1]. In all cases, it was attributed the value 3 for the parameter 5 of SFFS. 
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This parameter is a stop criterion of SFFS. Usually, < 5 < 3 in order to avoid that the algorithm 
stops at the first moment that it reaches the desired dimension. In this way, it performs more 
feature inclusion and deletion before returning the subset with the desired dimension, alleviating 
the nesting effect. The value 5 = 3 used as default here is the same default value adopted by 
the original algorithm implementation [11]. 

All data sets used and the binary program with some documentation can be found at the supple- 
mentary material web page (http : / / www . vision . ime . usp .br/~david jr/ucurve). 

A. Cost function adopted: penalized mean conditional entropy 

The Information theory was originated from Shannons works [12] and can be employed on 
feature selection problems [5]. The Shannon's entropy H is a measure of randomness of a 
random variable Y given by: 

H(Y) = -J2P(y)logP(y), (1) 

yeY 

in which P is the probability distribution function and, by convention, • logO = 0. 
The conditional entropy is given by the following equation: 

H(Y\X = x) = - p (y\* = x)logP(y\X = x) (2) 

y&Y 

in which X is a feature vector and P(Y|X = x) is the conditional probability of Y given the 
observation of an instance x e X. Finally, the mean conditional entropy of Y given all the 
possible instances x e X is given by: 

E[H{Y\X)] = Y J P^)H{Y\^) (3) 
xex 

Lower values of H yield better feature subspaces (i.e., the lower H, the larger is the information 
gained about Y by observing X). 

In practice, H(Y) and H(Y\X) are estimated. A way to embed the error estimation, committed 
by using feature vectors with large dimensions and insufficient number of samples, is to atribute 
a high entropy (i.e., penalize) to the rarely observed instances. The penalization adopted here 
consists in changing the conditional probability distribution of the instances that present just a 
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unique observation to uniform distribution (i.e., the highest entropy). This makes sense because 
if an instance x has only 1 observation, the value of Y is fully determined (i.e., H(Y\K = x) = 
0), but the confidence about the real distribution of P(Y|X = x) is very low. Adopting this 
penalization, the estimation of the mean conditional entropy becomes: 

E[H(Y\X)] = j+ H*)H(Y\x), (4) 

xeX:P(x)>i 

in which t is the number of training samples and N is the number of instances with P(x) = | 
(i.e., just one observation). In this formula, it is assumed that the logarithm base is the number of 
possible classes \Y\, thus, normalizing the entropy values to the interval [0, 1]. This cost function 
exhibits U-shaped curves, since, for a sufficiently large dimension, the number of instances with 
a single observation starts to increase, increasing the penalization and, consequently, increasing 
the cost function value (i.e., next features included do not give enough information to compensate 
the error estimation). 

B. Data sets description 

1) W-operator window design: the W-operator window design problem consists in looking 
for subsets of a size n window for which the designed operator has the lowest estimation error 
(i. e., the transformed images generated by the operator are as similar as possible of the expected 
images). The training samples were obtained from the images presented in [8]. It is composed 
by 20 files with 18,432 samples each. There are 16 features assuming binary values and two 
classes. 

2) Biological classification: the biological classification problem studied is the problem of 
estimating a subset of predictor genes for a specific target gene from a time-course microarray 
experiment. The data set used for the tests is the one presented in paper [9]. They are normalized 
and quantized in 3 levels using the same method described in [3]. The subset of predictors is 
obtained from a set of 27 genes. Thus, there are 27 features assuming three distinct values and 
three possible classes. It is composed by 10 files with 15 samples each. 

3) U CI Machine Learning Repository: UCI Machine Learning Repository data sets considered 
are: pendigits, votes, ionosphere, dorothea filtered, dexter ^filtered, spambase, sonar and madelon. 
For all data sets, the feature values were normalized by subtracting them from their respective 
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means and dividing them by their respective standard deviations. After that, all values were bina- 
rized (i.e., associated to 0, if the normalized value is non-positive, and to 1, otherwise). Except for 
dorothea_filtered and dexter .filtered, all features were taken into account. The dorothea .filtered 
and dexter .filtered are files post-processed from dorothea and dexter data sets, respectively. In 
the dorothea and dexter data sets, most features display null value for almost every sample. 
So, dorothea -filtered considered only the features with 100 or more non-null values, while 
dexter -filtered considered the features with 50 or more non-null values. 
A description of each data set is presented in the following list: 

• pendigits: composed by 7494 samples, 16 binary features and 10 classes; 

• votes: composed by 435 samples, 16 ternary features and 2 classes; 

• ionosphere: composed by 351 samples, 34 binary features and 2 classes; 

• dorothea -filtered: composed by 800 samples, 38 binary features and 2 classes; 

• dexter -filtered: composed by 300 samples, 48 binary features and 2 classes; 

• spambase: composed by 4601 samples, 57 binary features and 2 classes; 

• sonar: composed by 208 samples, 60 binary features and 2 classes; 

• madelon: composed by 2000 samples, 500 binary features and 2 classes. 

C. Results 

The feature selection problem may have cost functions with chains that present oscillations 
and there is no theoretical guaranty of the existence of alternative chains to achieve the local 
minima lost because of the oscillations. However, these cases were tested experimentally and in 
all observed cases the minimum exhausting procedure could find the local minimum elements 
using alternative chains. We have examined 100,000 random curves in all data sets studied. For 
example, in the W-operator window design almost 24, 000 curves (24%) contains oscillatory parts 
and in the biological classifier design almost 15,000 curves (15%) contain oscillatory parts. For 
all these oscillatory curves and also for those found in the UCI data sets, the minimum exhausting 
procedure got the local minimum by alternative chains. 

The results of the U-curve algorithm are divided in two sets: i - until it beats the SFFS 
result (UC); ii- until the search space is completely processed (UCC). The U-curve algorithm 
is stochastic and at each test it can reach the best result in different processing time. So, the 
U-curve was processed 5 times for each test and the quantitative results presented are means of 
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values gotten in these 5 processes. The machine used for the tests was an AMD Turion 64 with 
2Gb of RAM. 

In the following, each of the three experiments performed is summarized by a table and all 
these tables have the same structure. The first column presents the winner of the comparison of 
SFFS with UC. The other columns present the cost in terms of processed nodes and computational 
time of SFFS, UC and UCC. 

Table I shows the results for the W-operator window design experiment. Twenty tests were 
performed using the available training samples. UC beats SFFS in 8 of the 20 tests and reaches 
the same result in the remaining ones. In these last cases, both reach the global minimum 
element. In all cases, UC processes a smaller number of nodes, in a smaller time, than SFFS. 
The complete search (UCC) frequently needs to process more nodes (17/20), taking more time 
(19/20), than SFFS. 

Table II shows the results for the biological classifier design experiment. Ten tests were 
performed using different target genes. In these examples, the complete search space is quite 
big (2 27 nodes). SFFS reaches the best element, equalling UC, only 3/10 times. The processing 
of the whole space (UCC) improved the result of UC in 7/10 times. UC processed many more 
nodes than SFFS, but their computational times are very similar. This happens because these 
experiments involve small number of samples and, therefore, the computational time spent to 
process a node is very small. The pre-processing overhead is the major responsible for the time 
consuming in this case. 

Table III shows the results of 8 tests using public datasets. For each test, the value in parenthesis 
is the number of features (n) in the data set. For tests with high number of features, the results 
for the complete search (UCC) are not available. We can see that UC obtained better results than 
SFFS in 6/8 of the tests and equal results in two tests with small number of features. In these 
two cases, SFFS reaches the best result but UC reaches them faster, processing less nodes. 

These results show that UC is more efficient than SFFS for low order problems, obtaining the 
same results with less processing. For high order problems, UC is more accurate, but in some 
cases it process more nodes and takes more time. 

VI. Conclusion 

This paper introduces a new combinatorial problem, the Boolean U-curve optimization prob- 
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TABLE I 

Comparison between SFFS and U-curve results for the W-operator window design. 
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lem, and presents a stochastic branch-and-bound solution for it, the U-curve algorithm. This 
algorithm gives the optimal elements of a cost function decomposable in U-shaped chains, that 
may even be oscillatory in a given sense. This model permits to describe the feature selection 
problem in the context of pattern recognition. Thus, the U-curve algorithm constitutes a new 
tool to approach feature selection problems. 

The U-curve algorithm explores the domain and cost function particular structures. The Boolean 
nature of the domain permits to represent the search space by a collection of upper and lower 
restrictions. At each iteration, a beginning of chain node is computed from the search space 
restrictions. The current explored chain is constructed from this node by choosing upper or 
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TABLE II 

Comparison between SFFS and U-curve results for the biological classification design. 



Test 


W/ innpr 
VV 11111C1 


Computed nodes 


Time(sec) 






SFFS 


TTC 

V J V 


TTCC 

V J V V 


SFFS 


TTC 


nrr 

k v V 


1 


FOTTAT 


1 ^ 


777 


Q Q(S4 




u.u 


"\ 1 


2 


uc 


135 


9252 


30, 724 


0.5 


2.1 


11.2 


3 


uc 


135 


1037 


9,410 


0.5 


0.6 


3.1 


4 


uc 


164 


786 


9,276 


0.5 


0.6 


3.1 


5 


uc 


281 


247 


6,126 


0.5 


0.6 


1.5 


6 


EQUAL 


135 


2675 


11,031 


0.5 


0.7 


7.3 


7 


EQUAL 


135 


998 


10,836 


0.5 


0.6 


6.9 


8 


UC 


135 


463 


5,381 


0.5 


0.5 


1.5 


9 


uc 


135 


246 


4,226 


0.5 


0.5 


1.5 


10 


uc 


191 


474 


8,930 


0.5 


0.5 


2.9 



lower adjacent nodes. The choice of a beginning of chain and of an adjacent node usually has 
several options and one of them is taken randomly. The cost function and domain structure 
permit to make cuts in the search space, when a local minimum is found in a chain. After a 
local minimum is found, all local minimum nodes connected to it are computed, by the minimum 
exhausting procedure, and the corresponding cuts, by up-down intervals, executed. The adjacency 
and connectivity relations adopted are the ones of the search space Hesse diagram, that is a graph 
in which the connectivity is induced by the partial order relation. The minimum exhausting 
procedure avoids that a node be visited more than once and generalizes the algorithm to cost 
functions decomposable in some class of U-shaped oscillatory chain functions. The procedures 
of the U-curve algorithm are supported by formal results. 

In fact, the U-curve optimization technique constitutes a new framework to study a family of 
optimization problems. The restrictions representation and the intervals cut, based on Boolean 
lattice properties, constitutes a new optimization structure for combinatorial problems, with 
properties not found in conventional tree representations. 

The U-curve was applied to practical problems and compared to SFFS. The experiments 
involved window operator design, genetic network identification and six public data sets obtained 
from the UCI repository. In all experiments, the results of the U-curve algorithm were equal or 
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TABLE III 

Comparison between SFFS results and U-curve algorithm for the UCI Machine Learning Repository 

DATA SETS. 
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better than those obtained from SFFS in precision and, in many cases, even in performance. The 
results of the U-curve algorithm considered for comparison are the mean of several executions 
for the same input data, since it is a stochastic algorithm that may have different performances 
at each run. 

The efficiency of the U-curve algorithm depends on the relative position of the local minima 
on the search space. The algorithm is more efficient when the local minima are near the search 
space extremities. The worst cases are the ones in which the local minima are near the middle 
of the lattice. 

The results obtained until now are encouraging, but the present version of the U-curve 
algorithm is not a fast solution for high dimension problems with many local minima in the 
center of the search space lattice. The efficient addressing of these problems in the U-curve 
optimization approach opens a number of subjects for future researches such as: to develop 
additional cuts to the branch-and-bound formulation; to design and estimate distributions for the 
random parameters used in the choice of beginning nodes or adjacent paths in the construction 
of a chain, with the goal of reaching earlier to the best nodes; to build parallelized versions of 
the algorithm; and others. 



October 30, 2008 



DRAFT 



A BRANCH-AND-BOUND OPTIMIZATION ALGORITHM FOR U-SHAPED COST FUNCTIONS ON BOOLEAN LATTICES 



25 



Appendix 

Theorem 1. For every A e X(JZ L ), 

A G X{K L ) A n R c ± 0, VR G n L . 



Proof: 



AeX(7Z L ) Ae£-\J{[Q,R]:ReK L } 
& A$\J{[<l>,R]:ReK L } 

A£ [0, R],VR G TZl 
& A% R,VReTZ L 

AnR c ^$,VReK L 



□ 

Theorem 2. The element C of X {TZl) returned by the minimal construction process (Algorithm 
4) is a minimal element in X{TZ L ) 

Proof: By looking into the steps of the minimal construction procedure: 

• Lines 7-15 guarantee that at any step of the procedure the resulted C is contained in X(TZl), 
i. e., it is updated only when the resulted C satisfies the condition shown in Theorem 1. 

• Let Ci, . . . ,C n be the sequence of resulting elements at each step % (i = 1, . . . , n) and 
C = ^ e me initial element. As an index k is chosen to be removed from d-i (lines 4- 

n 

6) at each step i, it implies that C n C C n _i C . . . C C . 

• Proving that the resulting element C n is mimimal in X(JZ L ) is equivalent of proving that 
VleC n ,C n \{l}#X(K L ). 

• Let k — 1,1 G C n and % be the step of the procedure when the index I is chosen to be 
removed from Cj_i. C n C Cj and / G C n imply that / G Cj, i. e., / cannot be removed 
from Cj_i at the end of step i. This is avoided by the algorithm (lines 8-12), when there 
exists an element R G TZl with R c n (Ci_i \ {/}) = 0. As C n \ {1} C Ci_i \ {/}, then 
^ c H (C n \ {/}) = and, by Theorem 1, C n \ {1} £ X(1Z L ). This implies that C n is a 
minimal element in X(TZl)- 
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□ 

Theorem 3. Let Co, C fe _i, Ct be the chain constructed by Algorithm 2 (or its dual version). 
Let c be the cost function from C to R decomposable in U-shaped curves and c(Ck) > c(Ck-i). 
It is true that, 

VA e C, C k C A => c{A) > c(C fc ). 

Proof: Suppose that 3B e £, Cfc C 5 and c(5) < c(Cfc). It contradicts the hypothesis that c 
is a function decomposable in U-shaped curves, since Ck-i C C £>, but max(c(Cfc_i), c(5)) 
is either c(Cfc_i) < c(Cfe) or c(B) < c(Cfc), contradicting max(c(Cjt_i), c{B)) > c(Cfe). □ 
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